(Here is the original Reddit comment announcing this collection of data and what the processes were.) This is an archive of Reddit comments from October of 2007 until May of 2015 (complete month). This reflects 14 months of work and a lot of API calls. This dataset includes nearly every publicly available Reddit comment. Approximately 350,000 comments out of ~1.65 billion were unavailable due to Reddit API issues. Q: How are the files structured? Each file is compressed with bzip2 compression.... favoritefavoritefavoritefavoritefavorite ( 3 reviews )
Dark Net Markets (DNM) are online markets typically hosted as Tor hidden services whose users transact in Bitcoin or other cryptocoins, usually for drugs or other illegal/regulated goods; the most famous DNM was Silk Road 1, which pioneered the business model. From 2013-2015, I scraped/mirrored on a weekly or daily basis all existing English-language DNMs as part of my research into their usage, lifetimes/characteristics, & legal riskiness; in addition, I made or obtained copies of as many... Topics: Tor, Bitcoin, drugs, Silk Road, Evolution, Agora, black-markets, dark net markets
Culled from various sources, this collection includes over one million JPG, PNG and GIF album covers. The resolution ranges from "thumbnail" through to very large sizes. Filenames are variant in usefulness, although a good number indicate at least the name of the original album. This dataset is for experimentation and image processing research only. At 148gb, the collection is large but not unmanageable (there is a torrent available) and allows a developer or artist to work with the... favoritefavoritefavoritefavoritefavorite ( 1 reviews ) Topics: dataset, big data, album covers, covers, cover art, cover photos
I took the Reddit comment archive and converted all the JSON into one SQLite database using this program that I wrote: https://gist.github.com/ers35/3b615a75fa0ed5e6d5cc I ran a few tests to make sure the number of database rows matches the number of JSON records. "SELECT MAX(rowid) FROM comment" and "SELECT COUNT(id) FROM comment" both return 1659361605. This gives me some confidence as to the integrity of the dataset, but I cannot be 100% sure. The compressed size is 163G....
FOIA/FOILed Taxi Trip Data from the NYC Taxi and Limousine Commission 2013. Released by http://chriswhong.com/open-data/foil_nyc_taxi/ trip_data.7z and trip_fare.7z are more efficiently compressed versions of the data, you probably want these files. The data is in csv format. For the data files this includes the fields: medallion, hack_license, vendor_id, rate_code, store_and_fwd_flag, pickup_datetime, dropoff_datetime, passenger_count, trip_time_in_secs, trip_distance, pickup_longitude,... Topics: data, nyc, taxi, fare, csv, FOIA, FOIL Source: torrent:urn:sha1:6c594866904494b06aae51ad97ec7f985059b135
CITY OF SCREENSHOTS is a dataset of automatically screenshotted computer programs taken from the uploaded items in the Internet Archive's software collection. An ever-growing dataset, it currently contains over 450,000 individual screenshot images from over 60,000 software programs for the Apple II, Atari 8-Bit, MS-DOS, Windows, ZX Spectrum and Tandy Color Computer. Screenshots were generated by automated script: An instance of Firefox would be spun up and aimed at the emulated item on the... Topics: screenshots, computers, screens, artwork
This is a complete copy of the GeoNames database dumps fetched on 2017-06-13 from http://www.geonames.org/export/. It does not include any postal code information. See the readme.txt file for more information. " The GeoNames geographical database covers all countries and contains over eleven million placenames that are available for download free of charge."
This is the monthly dump of DISCOGS.ORG data, provided to the public domain. The data is in XML format and formatted according to the API spec: http://www.discogs.com/developers/. This dump has been generated and archived automatically. Official name is discogs_20080309 and the data is from 2008-03-09.
This is a mirror of a CiteSeerX database dump, downloaded from S3. It's hosted here for easy Internet Archive analytics access, and so we don't need to re-pay S3 download fees. See also: http://csxstatic.ist.psu.edu/about/data
This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web , by the Federal Energy Regulatory Commission during its investigation. The email dataset was later purchased by Leslie Kaelbling at MIT, and turned out to have a number of... Topics: Enron, E-mail, Dataset
This dataset mirrored from http://220.127.116.11/WebUI/CatDatabase/catData.html, which circa May 2017 is a dead link. The original page is available in Wayback: https://web.archive.org/web/20150520175645/http://18.104.22.168/WebUI/CatDatabase/catData.html The CAT dataset includes 10,000 cat images. For each image, we annotate the head of cat with nine points, two for eyes, one for mouth, and six for ears. The detail configuration of the annotation was shown in Figure 6 of the original paper:... Topics: cats, datasets, computer vision